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Abstract 

The paper describes several efficient parallel implementations of the one-sided hy- 
perbolic Jacobi-type algorithm for computing eigenvalues and eigenvectors of Hermitian 
matrices. By appropriate blocking of the algorithms an almost ideal load balancing 
between all available processors/cores is obtained. A similar blocking technique can be 
used to exploit local cache memory of each processor to further speed up the process. 
Due to diversity of modern computer architectures, each of the algorithms described 
here may be the method of choice for a particular hardware and a given matrix size. 
All proposed block algorithms compute the eigenvalues with relative accuracy similar 
to the original non-blocked Jacobi algorithm. 
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1 Introduction 

The Jacobi method is one of the simplest algorithms for diagonalization of symmetric matri- 
ces. For many years, during the era of predominantly sequential computing, the Jacobi-type 
algorithms were almost forgotten in practice, as they are too slow for serial computing, when 
compared with some faster diagonalization algorithms proposed since. 

But, in 1992. Demmel and Veselic's historic paper [TT] shed a new light onto the method. 
They proved that the Jacobi algorithm for positive definite matrices is accurate in the relative 
sense, which is the natural error model for standard floating-point computations. 

Let A be a symmetric positive definite matrix of order n with elements <2jj . If we introduce 
a perturbation Satj in each element of A, such that Saij is small in the relative sense with 
respect to ay, i.e., 

\Scnj \ < e\aij\, 
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then the eigenvalues A; of A, and A^ = Aj + SXi of A + SA, satisfy 



- < enK,(A s ), i = 1, . . . , n. 

A, 

Here, A s = (diag(A))- 1 /2^( diag ( J 4))-i/2 

is a special diagonal scaling of A, where diag(A) 
denotes a diagonal matrix containing the diagonal elements of A, and k(A s ) = ||j4J|2 H2 
is the scaled condition of A. It can also be shown that n(A s ) nearly minimizes the condition 
number over all possible diagonal scalings of A (see [33J for details). 

Almost at the same time, Veselic [33] has developed a Jacobi-type algorithm (sometimes 
called indefinite, hyperbolic or J- Jacobi algorithm) for diagonalization of definite matrix pairs 
{A, J), where A is symmetric positive definite, and J is a diagonal matrix of signs 1 and — 1. 
In contrast to the ordinary Jacobi method, which uses unitary congruences, the J- Jacobi 
method uses J-unitary congruences for diagonalization. This algorithm can also be used for 
diagonalization of indefinite symmetric matrices H (see Section 2). As a Ph.D. student of 
Veselic, Slapnicar [301 132"] has shown that the algorithm is accurate in the relative sense, as 
well. The same is true for various implementations of the J-Jacobi algorithm for complex 
Hermitian matrices [27] . 

Recently, Dopico, Koev and Molera [T^] have shown that the ordinary one-sided Jacobi 
algorithm can also be relatively accurate, if a symmetric, possibly indefinite, matrix H can 
be factorized as H = XDX T , where D is diagonal, and the factor X is well-conditioned. 
But, according to their own tests, the ordinary one-sided algorithm can have significantly 
more sweeps than the one-sided J-Jacobi algorithm. Moreover, when the eigenvectors of H 
are required, the J-Jacobi algorithm is much faster, even if both algorithms have the same 
number of sweeps. This is due to the fact that, after completion of the J-Jacobi process, 
the eigenvectors are obtained by simple scaling of the final matrix, while in the ordinary 
algorithm the eigenvectors are obtained by accumulation of transformations. Therefore, the 
J-Jacobi algorithm seems to be the method of choice for accurate computation of eigenvalues 
and eigenvectors of symmetric indefinite matrices. 

It is well-known that Jacobi-type algorithms are almost ideally parallelizable. Many 
authors have published, mostly two-sided, parallel implementations of the ordinary algorithm 
(see, for example, [JJ [3J [TS] [2H [35] [37]). Convergence results for different parallel 
orderings in the ordinary Jacobi-type algorithms are given in [T31 [2171 |2"T1 I21)] . 

Besides the inherent element-wise parallelism, the blocking paradigm can also be exploited 
to enhance the performance, especially in one-sided versions of the algorithms. In practice, 
the idea of blocking can be applied with two different goals in mind — to achieve data inde- 
pendency for parallelization, or to reuse large chunks of data. These two opposite goals for 
blocking can be used at different levels of the algorithm to increase its efficiency 

Blocking to reuse data can be used in sequential Jacobi-type algorithms to achieve signifi- 
cant speedups in presence of a two-level memory hierarchy on a single processor (see |17U18| ). 
Quite generally, there are two different approaches to blocking of these algorithms: the block- 
oriented approach (see [UJ), and the full block approach (see [18J). For some well-known 
classes of pivoting strategies, including the block column-cyclic strategy, both approaches 
lead to globally convergent algorithms [TBI HTl [T8] . These blocked versions of the algorithms 
retain essentially the same accuracy as the non-blocked ones |17| . 

In this paper, we present several three-level and two-level parallel implementations of the 
one-sided J-Jacobi algorithm. In three-level implementations, the outer level of blocking is 
used for parallelization of the algorithm, while the inner level of blocking targets the local 
memory hierarchy. Finally, the bottom level of actual computation (in terms of elements of 
matrices) efficiently uses the local cache memory of each processor/core. The inner level of 
blocking may be omitted if the working blocks of matrices are small enough to reside in cache 
throughout the whole computation phase. These simplified two-level parallel implementations 
may be significantly more efficient for smaller matrices. 
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For larger matrices, the inner level of blocking in each algorithm brings additional speedup 
over the two-level parallel algorithm. This speedup depends on the algorithm, the matrix size, 
the number of processes used, and the interprocess communication involved. For example, 
for matrices of order n = 16000 on 16 processor cores, the speedup for all used three-level 
versions of the algorithm is over 40%, and may reach almost 70%. The crossover point in 
efficiency between two-level and three-level implementations depends heavily on a particular 
hardware architecture, and should be determined by numerical testing. 

The global convergence of parallel algorithms can also be proved, if we use the block 
modulus pivot strategy (with slight modifications for inclusion of the diagonal blocks) for the 
outer level of blocking. This follows from the fact that the block modulus pivot strategy is 
weakly equivalent to the block antidiagonal ordering, and this strategy is weakly equivalent 
to the block row-cyclic strategy (see [26J). Finally, block-cyclic strategy (with cyclic by rows 
or columns inner strategy) is convergent (see [17l 118) ). 

The paper is organized as follows. First, we give a detailed description of the one-sided 
J-Jacobi algorithm. Section 3 deals with sequential blocked versions of the algorithm. In 
Sections 4 and 5 we describe the two-level and three-level parallel implementations of blocked 
algorithms, respectively. The final section contains the results of numerical testing. 



2 Two-sided and one-sided Jacobi algorithms 

In this section we shall establish a connection between two-sided and one-sided versions of 
the Jacobi algorithm. We begin by describing the ordinary two-sided Jacobi algorithm, as 
it is easier to understand. Then we derive the one-sided hyperbolic algorithm and describe 
some of its advantages for practical computation. 



2.1 The two-sided Jacobi algorithm 



The ordinary two-sided Jacobi algorithm for diagonalization of a given Hermitian matrix H 
performs a sequence of similarity transformations. Each transformation annihilates a single 
off-diagonal element in the working matrix (see |23l I25| ). The similarity used to annihilate 
the element in position (r, s) consists of a trigonometric rotation Ut(t, s) in the (r, s) plane. 
In the real case, it is equal to the identity matrix, except at the intersections of the r-th and 
s-th rows and columns, where, in Matlab notation, 



U T (r,s)([r,s],[r,s]) = U T :-- 



cos f sin ip 
- sin (p cos if 



(2.1) 



In the complex case, we can use various forms of trigonometric plane rotations Ut- For 
example, 

cos if e la sin f 
-e~ la sva.f cos if 



or 



U T = 



e cos f e sin <p 
— sin p cos if 



or Ut = 



cos if 



sin if 



—e %a sin if e %a cos p 



(2.2) 
(2.3) 



Since Ut {r, s) is a unitary matrix, each similarity transformation is also a unitary congru- 
ence. The whole sequence of transformations is usually divided into sweeps. In each sweep, 
all n(n — l)/2 off-diagonal elements in, say, the upper triangle of the working matrix are 
annihilated exactly once, according to some prescribed traversing or annihiliation order, also 
known as the pivot strategy. The following strategies are in widespread use: 

1. row- and column-cyclic strategies, with the advantage of sequential data access (thus, 
well suited for the inner level of blocking) , 

2. modulus and round-robin strategies, frequently used for parallelization. 
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Quite generally, the two-sided Jacobi process on an initial matrix — H is an iterative 
process of the following form: 

H (S) = [c/ (s) r . . . p^YpWyHU^U^ ■ ■ ■ U^ s \ (2.4) 

where f/W denotes a product of all n(n — l)/2 trigonometric rotations applied in sweep i, for 
i — 1, . . . ,S. The process terminates after, say, S sweeps, when all off-diagonal elements of 

H (S) 

are relatively small compared to dia,g(H^). Then, the diagonal elements of are 
the computed eigenvalues, and U := U^U^ ■ ■ ■ is the computed eigenvector matrix. 

2.2 Factorization for the one-sided algorithm 

The iterative process (I2.4[) can also be organized as a one-sided algorithm. In the first phase, 
the method computes the Hermitian indefinite factorization of the initial matrix H , by a 
variant of the Bunch-Par lett factorization (see [21151 [71 151 M [TU| ) 

PHP T = MDM*. 

Here, P is a permutation matrix, M is a lower triangular matrix, and D = D* is a block- 
diagonal matrix with diagonal blocks of order 1 or 2. An additional diagonalization of the 
diagonal blocks of D, and an appropriate scaling of the columns of M (see |3T] for details) 
yield 

PHP T = GJG*, J = di a g(ii 1) ...,j n „). (2.5) 

If H is nonsingular, then G is lower block-triangular with diagonal blocks of order 1 or 2, 
and J contains the inertia of H, i.e., ju G { — 1,1}, for i = 1, . . . ,n. If H is singular and 
rank(_ff ) = m < n, then G in (12. 5p is a lower block-trapezoidal n x m matrix of full column 
rank. In this case, J is of order m and contains the signs of nonzero eigenvalues of H. 
Note that G*G is always positive definite, which will be sufficient for the one-sided J- Jacobi 
algorithm (see below). 

For a nonsingular matrix H, the ordinary one-sided Jacobi algorithm applies, from the 
right on G* , the sequence of rotations that diagonalizes H (or PHP T ) in (|2.4p . This algorithm 
can be accurate in the relative sense [T2] , but is significantly less efficient than its hyperbolic 
counterpart. 

The hyperbolic or J- Jacobi algorithm for the eigenproblem Hx — \x is obtained by using 
the factored form (|2.5p of H, 

GJG*x = Xx. 
Pre- multiplication by G* and J = J^ 1 give 

G*Gz = AJz, z = JG*x. (2.6) 

If G is of full column rank, then the eigenproblem for H (except for zero eigenvalues) is 
equivalent to the eigenproblem for the pair (G*G, J) from (|2.6p . with positive definite matrix 
A := G*G. Thus, we can apply the Jacobi eigenreduction algorithm for positive definite 
pairs by Veselic [34] • If H is singular, the eigenvectors corresponding to the zero eigenvalue 
can be computed by a post-processing step, described below. 

The matrix G in (|2.6p does not have to be computed by the Bunch-Parlett factorization. 
In fact, it could be any full column rank factor of H, such that (|2.5p holds. This is useful in 
some applications where H is given implicitly by (|2.5p . and G is readily available. 

2.3 The one-sided J-Jacobi algorithm 

In the second phase, the eigenvalues of the pair (A, J) are computed by simultaneous diago- 
nalization of A and J. This is done by a sequence of J- unitary congruences, i.e., matrices V 
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such that V*JV — J. These transformations preserve the structure of J, and only A is actu- 
ally diagonalized in the process. Since A — G*G, this is equivalent to the orthogonalization 
of columns of G, and will be used to compute the eigenvectors of H . 

Simple J-unitary matrices are either trigonometric, or hyperbolic plane rotations. In 
the real case, a hyperbolic plane rotation Uu{r, s) is equal to identity matrix, except at the 
intersections of the r-th and s-th rows and columns, where 



U H (r, s)([r, s],[r, s}) = U H ■= 



cosh ip sinh (p 
sinh ip cosh (p 



(2.7) 



In the complex case, there are several variants of hyperbolic rotations Uh, similar to the 
trigonometric case (|2.2|) - (|2.3|) . The trigonometric rotation Ut(t, s) is J-unitary if j rr = j ss , 
while the hyperbolic rotation Un(r,s) is J-unitary if j rr = —j ss . In Jacobi algorithms, the 
corresponding J-unitary congruence is used to annihilate the (r, s) element in the working 
matrix A. 

The hyperbolic or J- Jacobi algorithm systematically annihilates the elements of the work- 
ing matrix A, until it is diagonalized to a desired precision. As before, we have two-sided 
and one-sided versions of the algorithm. 

Let = G be the initial factor of H from (|2"3]l . and let A^ = A = G*G be the 
corresponding initial positive definite matrix in (|2.6|) . 

To simplify the notation, regardless of the chosen pivoting strategy, let W^ k > be the plane 
rotation used in the fc-th step of the process. If (r, s) denotes the pair of indices determined 
by the pivoting strategy in step k, then is either Ut(t, s), or Uh(t, s). 

In two-sided J-Jacobi algorithms, J-unitary matrices WW are applied on both sides of 
A, and the transformation in step k can be written as 

= [W^YA^^W^, J [k) = [W < - k) ]*J < - k ~ 1 ' ) W < - k) = J. (2.8) 

On the other hand, in one-sided algorithms, they are applied from the right on G, and the 
transformation in step k is 

G (k) =G {k-i) w {k)_ ( 29 ) 

Since, initially, A = G*G and G is of full column rank, it follows immediately that in each 
step A^> = [G^]* G^ k > and G^ is also of full column rank, so A^ remains positive definite 
throughout the process. The annihilation of element ai k 3 1] in (JM1) is equivalent to the 



mutual orthogonalization of two pivot columns gi k g^ k ^ of G^ k ~^ in 

Numerically stable formulae for computing the elements of W^ k ', and the transformed 
elements in the new working matrix A^ k ' or G^, can be found in |34| . 

The algorithm stops after, say, I steps, when the working matrix A"' becomes diagonal 
to the working precision. Then, in the one-sided algorithm, the columns of the final matrix 
G := G« are orthogonal to the working precision. Therefore, G can be written as 

G = UT,, (2.10) 

where U has orthonormal columns, and S is diagonal, containing the norms of the columns 
of G. Both E and U can be computed trivially from the final G, and the columns of U are 
orthonormal to the working precision. Let 

W := W^W^ ■ ■ ■ 

be the product of all used J-unitary matrices that diagonalizes the pair (A, J), and let V = 
W~*. Then, W and V are J-unitary, as well, since J-unitary matrices form a multiplicative 
group. From (f23]) we see that G = GW, and (|2~TU)l yields 

GW' 1 = UYM?- 1 = UHV* = G. (2.11) 
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The last equality is actually the hyperbolic singular value decomposition (HSVD) of G (see 
[4]), computed by the one-sided J-Jacobi algorithm of Veselic [34| . 

The general form of the HSVD of a rectangular n x m matrix G with respect to J is 
described in [36] . When n > m and rank(G JG*) = rank(G) = m, it has the following form 



G = U 



V* 



£ = diag(cri, cr 2 , . . . , <r m ), ci > cr 2 > • • • > cr m > 0, 



where ?7 is unitary of order n, £ is a diagonal matrix of positive hyperbolic singular values, 
and V is J-unitary of order m. If H = GJG*, resembling (|2.5p . then, by using the HSVD of 
G, we obtain 



H = GJG* = U 



V*JV [S 0] E7* = t7 



JE 2 




The squares of the hyperbolic singular values of G are, up to the signs in J, the nonzero 
eigenvalues of H , and U is the corresponding eigenvector matrix. 

If H is singular, of rank m < n, then G and U in (|2.1ip are rectangular n x m ma- 
trices, and [/ contains the eigenvectors that correspond to the nonzero eigenvalues of H. 
Once we compute U , the remaining eigenvectors for the zero eigenvalue span the orthogonal 
complement of the column space of U. If required, they can be computed by the full QR 
factorization of U (see |19|). 

Note that the one-sided J-Jacobi algorithm directly computes the eigenvector matrix U , 
and numerical orthogonality of the computed eigenvectors is guaranteed by the termination 
test for the process (|2.9[) . There are several other advantages of the one-sided algorithm over 
the two-sided. 

(a) Even if the elements of in (|2.8p and (|2.9p are not computed very accurately, this will 
not harm the one-sided process, as long as W 1 -^ is numerically J-unitary to the working 
precision. The one-sided algorithm will then reduce the off-diagonal norm of the pivot 
submatrix 



.4 



(fc-i) 



(fc-i) 

Urs 



.(*-!)' 



9r k - 1) ,gi k - 1) 



instead of diagonalizing it. On the other hand, in the two-sided algorithm, the new off- 
diagonal element ai^} is explicitly set to zero, so the computed elements of should 
be as accurate as possible, 
(b) The two-sided algorithm is much slower than the one-sided algorithm. In each step, the 
one-sided algorithm changes only the two pivot columns of the working matrix G^^ 1 ', 



(fe)l - fo^- 1 ) o^- 1 )] W (k) 



(2.12) 



where W p k ^ is of order 2, and denotes the nontrivial, pivotal part of W^ k \ i.e., W p k ^ is 
either Ut from (|2.ip - (|2.3p . or Ur from (|2.7p . including the complex forms of Uh ■ This 
operation — the so-called update of columns, is easily vectorized by modern compilers. On 
the contrary, the two-sided algorithm changes some parts of two rows and two columns 
in one triangle of A^ k ^ 1%> . Vectorization and other optimizations of such an operation 
still impose great challenges to present-day compilers. 

Henceforward, we shall often omit the superscripts when referring to the parts of G^- k ~ x \ 
indicating that the parts involved refer to the current state of the working matrix G. For 
example, the update of columns (|2 . 12|) will be denoted simply by 



[g' r ,g's\ = [gr,g s ]w ] 



(*■) 



(2.13) 



where primes denote the new (updated) values of columns. 



G 



Finally, numerical testing shows that the J-Jacobi method becomes faster by approxi- 
mately 10%, if J is ordered as J = dia,g(I v , — I n -v), where v £ {0, ...,n} is determined 
by the inertia of H. This ordering can be achieved by using the congruence transformation 
(G*G, J) M> (P^G*GP 1 , P* JP ± ), with a suitable permutation matrix P\. From now on, we 
assume that J is already partitioned in such a way. 

The optimal choice of the pivoting strategy for the non-blocked version of the one-sided 
algorithm depends on how matrices are stored within a particular programming environment. 
Since all our test programs are written in Fortran, where matrices are stored in column-major 
order, we use the column-cyclic strategy for non-blocked algorithms. 

3 Sequential blocked versions of J-Jacobi algorithm 

On hardware architectures with two or more layers of memory with different bandwidths, 
many blocked algorithms show significant speedups. The same can be done by blocking in 
the second, iterative part of one-sided Jacobi algorithms. 

Two blocked modifications of the one-sided J-Jacobi algorithm have been proposed in 
[17L I18| : the block- oriented algorithm and the full block algorithm. The aim of blocking 
in these algorithms is to speed-up the most expensive operation, and that is the update 
of columns (|2.13|) in the working matrix G, by turning it from a BLAS 1 into a BLAS 3 
operation. Therefore, both algorithms use a one-dimensional partition of G into blocks of 
full-length columns. 

3.1 Block-column partition 

A blocked version of the algorithm begins by a block-column partition of the factor G into b 
blocks, and the corresponding partition of the diagonal matrix J 

G = [Gi, . . . ,G b ], J = diag( Jn, . . . , J bb ), (3.1) 

where the block-column Ge has ng columns, and the diagonal square block Ju is of order ng, 
for £= l,...,b. 

For sequential blocking in practice, the number of blocks b is calculated from the require- 
ment that all block-sizes m should be as close as possible to a given target block-size n*, 
without exceeding it. The optimal choice of nt for each algorithm should be determined by 
numerical testing, and will be illustrated at the end of this section. Once we know the value 
of n t , the number of blocks is given by 



Then we divide G into b blocks with almost the same number of columns. There are two 
easy ways to determine the actual block-sizes. Just for simplicity, we shall assume that the 
block-sizes are decreasingly ordered n% > • • • > ni,. 

(a) In a greedy partition, all the blocks, except possibly the last one, have the same maximal 
allowed size nt- The last block may have a smaller number of columns, given by 

rib = n — (b — l)n t = (n — 1) mod nt + 1. 

This is also valid when n t > n, i.e., when b = 1 in (|3.2p . Note that if b > 1, the last 
block can have a significantly smaller number of columns than all the other blocks, for 
m — n b can be as large as n t — 1. 

(b) In a uniform partition, this cannot happen, since we require that m — n b < 1, so the 
numbers of columns in any two blocks may differ by at most one. 



ru-l 



(3.2) 
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Because of that, the uniform partition is more suitable for blocked parallel implementations 
of the algorithms, as it ensures a better load balancing between processes. Consequently, to 
unify the algorithms, it will be used for all types and levels of blocking, from now on. 

For a given number of blocks b, let b r = n mod b, and n m i n = \n/b\. The block-sizes in 
the uniform partition are then given by 




1, £=l,...,b r , 
£ = b r + l,...,b. 



(3.3) 



This follows easily from 

b 

1=1 



D-br 



(b-br). 



Here we can have b r = 0, and then all blocks have the same block-size n m i n . Otherwise, the 
maximal block-size is n m j n + 1. In contrast to the greedy partition, the maximal block-size 
n\ can be strictly smaller than the target block-size n t , even if b > 1. 



3.2 Block-pivot submatrices 

All blocked versions of Jacobi algorithms have a global structure very similar to the non- 
blocked version. The computation proceeds in block-steps that are usually grouped into 
block-sweeps, controlled by some prescribed block-pivoting strategy. This strategy deter- 
mines the blocks that are transformed in each block-step. 

For sequential blocked algorithms we shall use the block column-cyclic pivoting strategy 
at the block level. The same strategy will also be used in three-level parallel algorithms for 
the inner level of blocking, inside each process. 

In a blocked version of the one-sided J- Jacobi algorithm, each block-step performs a 
certain J-unitary transformation of the working matrix G. This transformation has the 
same form as in (|2.9[) . 

G (k) =G (k-i) w (k)^ ( 34 ) 

where k now refers to a block-step, instead of a single step. Various versions differ only in 
the choice of , to be described below. 

To begin with, all matrices G^ in (|3.4p are partitioned in the same way as the initial 
matrix G^ — G, so the partitioning of G^ k ' does not depend on k. Moreover, any column 
partition of G in (|3.ip induces a two-dimensional partition of A = G*G into b 2 blocks 

A i6 :=G*G J5 i,j = l,...,b, 

where Aij is, generally, a rectangular rii x nj matrix. Hence, all matrices A^ = [G^]* G^ 
have the same induced partition as the initial matrix A. 

Typically, in the fc-th block-step of the algorithm, the block-pivoting strategy chooses a 
pair of indices such that 1 < i < j < b. This pair determines two block-columns, 

or the so-called block-pivot submatrix in the current working matrix G^' -1 -', that will be 
transformed in this block-step. Like before, we shall omit the superscript (fc — 1) when 
referring to the individual blocks of G^ 1 ' and A^ k ~ x \ The block-pivot submatrix of G^' -1 ) 
is then defined by 

Gp := [G^Gj], (3-5) 
and it has n^ -1 -* = rii + nj individual columns. 

Ap • — 



Ati A ii. 



= \G 



0- 1)1*^0-1) 

y 1 p 



G*G 



G*G i 



(3.6) 



<s 



In both algorithms (full block and block-oriented) , in addition to the two block-column pivot 
submatrix Gp from (|3.5p . the single block-column pivot Gp ■= [Gj] is employed. In 
this case, the corresponding square block-pivot submatrix of ^4( fe_1 ) is 

Sip — P J ^ P — ^- r i^- J i~ 

Block-partitions of G^' -1 - 1 and are illustrated in Figure [1] Both block-pivot sub- 

matrices are shown in color. 

A (k-i) gO" 1 ) 



Gi Gj 



Figure 1: Block-columns [Gi,Gj] in G^ k ^ generate a square block Ap ^ in A^ k x ' = 
[G (fc ~ 1) ]*G (fc_1) . 

Note that Gp ^ is always of full column rank, so Ap ^ is certainly positive definite. 
Also, Gp 1 ' is one of many possible "Cholesky-like" factors of Ap . To speed up the 
process, instead of "tall" factor Gp" 1 -*, the Cholesky factor i?p l ' of Ap" 1 -* will be used. 

Transformations applied to Rp ^ will be accumulated and applied to Gp 1 in BLAS 3 
fashion by block matrix multiplication. 

The only difference between the full block and block-oriented algorithms lies in the trans- 
formation of block-pivot submatrices Ap from (|3.6p . 

3.3 Full block algorithm 

The main idea behind the full block algorithm is to work with block-columns in a similar 
fashion as the non-blocked one-sided algorithm works with individual columns. 

In the non-blocked algorithm, the pivot strategy chooses a pair of columns that generates 
a matrix pair (A' fc_1 \ of order 2 in (13. 6p . which will be diagonalized in step k. This 

simply means that the off-diagonal element A^ will be annihilated in that step. 

In the full block algorithm, the so-called outer, or block-pivot strategy chooses a pair 
of block-columns for transformation in this step. Likewise, they generate a matrix pair 
{A^ k ~ x \ jC 0-1 )) in (|3.6p , but the order of this pair is, generally, greater than 2. 

The non-blocked algorithm can now be mimicked in two different ways. 

1. We can annihilate only the off-diagonal block Aij, i.e., block-diagonalize the pivot sub- 
matrix A^" 1 ). In the end, this strategy produces a block-diagonal matrix. The final step 
consists of the diagonalization of these diagonal blocks. Unfortunately, it can be shown 
(see [18]) that this approach is only linearly convergent to the block-diagonal form. 

2. Therefore, in each step of the algorithm, the whole pivot submatrix A^ k ~ x ^ is diagonalized. 
Even though this strategy seems less efficient at first sight, it is quadratically convergent. 
If the block sizes are chosen to be sufficiently small, so that most of the required data 
resides in cache, the matrix A^ k ~ x ^ can be efficiently diagonalized by the non-blocked 
one-sided J-Jacobi algorithm. 
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In this approach, after the first full block-sweep, the diagonal blocks of the working matrix 
A remain diagonal throughout the whole process. Thus, it pays off to do a preprocessing 
step, before the block-sweep loop, in which all diagonal blocks are diagonalized. Afterwards, 
in each step we have to diagonalize a pivot submatrix of the following form 



A: 



(3.7) 



where A,,- and A,-,- are diagonal matrices. 

The one-sided J-Jacobi algorithm works on a factor of A^-V. This process will be the 
fastest possible, if it works on a square factor, not on a "tall" one. A suitable square factor 
i2( fe_1 ) can be computed from (|3.6|) in two ways: 

1. by the QR factorization of [Gi, Gj], which is a "tall" factor of A^ k ~ x \ or 

2. by the Cholesky factorization of A^ k ~ l > . 

Due to the special structure (|3.7[) of A^^ 1 ^, the second choice turns out to be much faster, 
as it requires only one matrix multiplication to form the off-diagonal block Ay . 

Moreover, the structure of A( fc_1 ) in (13. 7[) can also be exploited to efficiently compute the 
Cholesky factorization 



A {k-1) = [ fl (fc-l)]» i2 (*-l) j R(k-1) 



(3.8) 



Since the diagonal blocks of A^ k x > in (|3.7p are already diagonal, it is easy to see that 



(Ri; 



Ai 



Rjj Rjj ^jj Rij Rij ' 



(3.9) 
(3.10) 
(3.11) 



The final block Rjj can be computed by the Cholesky factorization of the right-hand side in 
(|3.1ip . An alternative is obtained by writing 



A; 



[Rt 



lii , Ijj ) 



GJG* 



and we can use the hyperbolic QR factorization of G* according to J to compute Rjj (see [28j 
for details). Both factorizations are accurate [25], but once again, the Cholesky factorization 
is faster. 

The diagonalization of A* -1 ) is now equivalent to the orthogonalization of columns of 
the Cholesky factor R^ k ~^ from (|3.8p . This is done by using the "in-cache" non-blocked 
one-sided hyperbolic Jacobi method to compute the HSVD of the upper triangular matrix 
R^ 1 ), and yields 

R (k-i) w (k) =c/ (fe) E (fc) ) ( 312 ) 

or 

ft(k-l) _ {j(fe)j](fe)ryWi* j y{k) _ \yy( k )Y* . 
From (|3"1^ and (|3~T2j) we obtain 

[W {k) ]*A^ ls> W {k) = [W^YIR^^]* R^-^W^ = [E (fe) ] 2 , 

so, is the matrix which simultaneously diagonalizes both A^ -1 - 1 and J^ k ~ x \ Note that 

can be computed in several different ways (see [T8]), but numerical tests show that the 
explicit accumulation of all used J-rotations is the best option, both regarding the speed, 
and the accuracy. A careful analysis shows that during the first few sweeps of the block 
method, the accumulation of transformations (on a micro-level) is relatively slow, but in the 
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last few sweeps, when the number of rotations applied becomes progressively smaller, the 
accumulation becomes faster and faster. 

The final and most time-consuming part of the block algorithm is the so-called update 
(|2.9p of block-columns in G^ 1 ). Post-multiplication of [d,Gj] by is done by the 

BLAS 3 routine xGEMM. However, since xGEMM cannot overwrite the original factor matrix 
[Gi,Gj], we use an additional n x tiq array as a workspace, where uq — 2maxj = i j ... iJ , rij. 
To avoid unnecessary copying of the pivot blocks, updated columns are not moved back to 
their original positions in the matrix. We just keep track of their current position, and in a 
post-processing step, all block-columns are re-permuted back to their original positions. 

Since two- and three-level algorithms are based on both sequential non-blocked and 
blocked algorithms, we will present them here for the sake of completeness. In the first 
auxilliary subroutine, Jacobi_Cycle (see Algorithm 13. ip cither off-diagonal elements of A, 
or the off-diagonal block Ay of A, are annihilated. The routine is a slightly modified version 
of the inner loop (for Hermitian matrices instead of symmetric) of Algorithm 1 from [32J . 



Algorithm 3.1: Single sweep of the hyperbolic Jacobi algorithm 
Jacobi_Cycle(G, J, D, W, m, m, rij, diag_bl); 

Description: If diag_ bl — true then G consists of a single block column, i.e., 

G — d G c mxn i anc i j — Else, G consists of two block columns, 
G = [Gi,Gj] 6 C mx(n ' +n ^ and J = diag(J II , J j3 ). The diagonal of A = G*G 
is stored in vector D, i.e., D = diag(A). Transformations applied to G are 
accumulated in matrix W. 

begin 

if diag_ bl then //parameters for triangular part of A 

| start _s — 2; final _s — m; 
else // parameters for rectangular block Aij = G*Gj 

| start _ s — m + 1 ; final _ s — m + ny, 
end if 

for s — start_ s to final_ s do 

if diag_ bl then final_ r=s — l else final_ r = Hi + nf, 
for r — 1 to final_ r do 

«r S = g*g s ; hyp = -J rr ■ J ss - 

// compute parameters for modified rotation defined by Q2 . 3D 

T] = sign(rea rs ) • |a rs |; e' a = a rs /n; //if A is real, 77 = a rs 
j? = h yp-(a ss -h yp-a rr )/(2 ■ rj); t = siga(0)/(|0| + ^ + hyp); 
h — \/l + hyp ■ t 2 ; cs — 1/h; sn — cs ■ t; 

II update diagonal of D 
d rr = d rr + hyp ■ t ■ rj; d ss — d ss + t ■ n; 

II update columns r and s of G and W 
f = e' a ■ 9r\ g r = cs- f -hyp- sn ■ g s ; g r = sn-f+ cs ■ g s ; 
f — e' a ■ w r ; w r = cs • / — hyp ■ sn ■ w s ; w r — sn ■ f + cs ■ w s ; 
end for 
end for 
end 



If the Jacobi_Cycle algorithm is repeatedly applied to the off-diagonal elements of 
A until convergence, the obtained algorithm is the one-sided hyperbolic algorithm (called 
Jacobi_Diagonalization in Algorithm 13. 2p . To speed up the calculation (see |32) for de- 
tails) the routine also computes the diagonal of A. Finally, in Algorithm 13.31 we present the 
sequential full block algorithm. 



11 



Algorithm 3.2: One-sided hyperbolic Jacobi diagonalization algorithm 
Jacobi_Diagonalization(G, J, D, W, m, n, max^it); 

Description: Diagonalization of pair (A, J), where A — G*G, G £ c mxn by the one-sided 
hyperbolic Jacobi algorithm, until either convergence, or maximal number of 
iterations max_it are reached. Vector D keeps the diagonal of A. 
Transformations applied to G are accumulated in matrix W . 

begin 

Uer=0; W = I n ; 
repeat 

iter = iter + 1; 

for i = 1 to n do da = <?*<?;; // in each cycle initialize the diagonal of D 
Jacobi_Cycle(G, J, D, W, m, n, 0, true); 
until (convergence) or (iter> max_it); 
end 



Algorithm 3.3: Sequential full block algorithm 
Full_Block(G, J, D, W, m, n, rnax_tt); 

Description: Diagonalization of pair (A = G*G, J), G £ c mxn by the full block Jacobi 
algorithm. 

begin 

partition G into almost equally-sized column blocks Gi, i = 1, . . . , 6; 
corresponding partition of J is J = diag(Jn, . . . , Jbb); 
W = I„; iter=0; 
repeat 

iter = iter + 1; 

for i = 1 to b do // diagonal block pre-processing 

Ap = GiGi, 

Cholesky factorization of Ap = R^R^; 

Jacobi_Diagonalization(i?ii , Ju, Da, Wu, rii, Hi, max_it); 
G t =Gr- Wu; Wi = Wi- Wu 
end for 

for j = 2 to b do // diagonalization of pivot block Ap 
for i = 1 to j — 1 do 

Ap = [d Gj]*[Gi Gj]; Jp = diag(Jii, Jjj); np = m+nj; 
Cholesky factorization of Ap = R* R; 

Jacobi_Diagonalization(_R, Jp, Dp, Wp, np, np, max_it); 
[Gi Gj] = [d Gj] ■ W P ; [Wi Wj] = [W; Wj] ■ W P 
end for 
end for 

until (convergence) or (iter> max it); 
end 



3.4 Block-oriented algorithm 

The block-oriented algorithm is a simple rearrangement of each sweep of the one-sided J- 
Jacobi algorithm in a cache-aware manner. 

Typically, in each sweep, the off-diagonal elements of diagonal blocks An in (13. 6p are 
annihilated first. Then, for each pivot submatrix A^^ 1 ^, all elements in the off-diagonal 
block Aij are annihilated once, thus completing a sweep. 

As in the full block case, the algorithm works on a factor of A^ fc_1 \ and again it is 
advantageous to have a square factor R^ k ~ x \ Since A^ 1 ^ has no special structure in the 
block-oriented algorithm, the factor is computed by using the standard Cholesky 

factorization. 
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The accumulation of and the update of block-columns in G^ k 1 - ) is done in exactly 
the same way as in the full block algorithm. 



Algorithm 3.4: Sequential block oriented-algorithm 
Block_Oriented(G, J, D, W, m, n, maxit); 

Description: Diagonalization of pair (A = G*G , J), G G c mxn by the block-oriented Jacobi 
algorithm. 

Assumption: If this function is used as a part of a parallel three-level block-oriented 

algorithm, partition of G should respect column boundaries of Gi_ uk and 
Gj_ bik- 

begin 

partition G into almost equally-sized column blocks Gi, i — 1, . . . ,b; 
corresponding partition of J is J = diag(Jn, . . . , Jbb); 
W = I n ; Uer=0; 
repeat 

iter = iter + 1; 

for i = 1 to 6 do // single pass through diagonal blocks 

W P =I n% ; A P = G*G,; 
Cholesky factorization of Ap = RtfR^; 
Jacobi_Diagonalization(i?ii , Ju, Da, Wu, rii, rii, 1); 
G l =G l - W u ; Wi = Wi- Wu 
end for 

for j = 2 to 6 do 

for i = 1 to j — 1 do // single pass through off -diagonal blocks 

W P = I ni+nj \ Jp = diag(Jii, Jjj); A P = [d Gj]*{G t Gj]; 
Cholesky factorization of Ap = R* R; 
Jacobi_Cycle(i?, Jp, Dp, Wp, m, Hi, rij, false); 
[Gi G 3 ] = [d G 3 ] ■ W P ; [W t Wj] = [Wi W 3 ] ■ W P 
end for 
end for 

until (convergence) or (iter> max_it); 
end 



3.5 Sequential testing 

The first tests were made on a single core processor, (Pentium 660J, 3.6GHz, running Win- 
dows 64-bit), in double precision. We were also using Intel Fortran compiler and the BLAS 
and LAPACK routines from the Intel Math Kernel Library. 

For block algorithms, a provably convergent strategy (see |17l \TE\): the block column- 
cyclic at the block level, and the ordinary column-cyclic inside each block is used. The same 
strategy is used in parallel three-level algorithms inside each process. 

The eigenvalues of symmetric test matrices are randomly generated in the given range, by 
using the LAPACK routine xLARND. The matrices are then generated by the routine xLAGSY 
for generation of symmetric matrices, based on random reflectors. Test matrices have various 
orders, from 500 to 4000 in steps of 500, and the block-columns Gi consist of 8-128 individual 
columns. 

Our tests show that the non-blocked Jacobi algorithm runs faster than its block coun- 
terparts for matrices of order less than 1000. For small orders, a big portion of the matrix 
resides in the cache memory, while book-keeping by the block algorithms takes too much 
time. For matrices of order 1000 < n < 2500, the block-oriented algorithm is the fastest one. 
Finally, for matrices of order n > 2500, the fastest algorithm is the full block algorithm. 

The speedups of the full block and the block-oriented algorithms, compared to the non- 
blocked algorithm, for n = 4000 are displayed in Figure [5] 
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Figure 2: Speedup of sequential blocked algorithms for matrix size n = 4000, compared to 
non-blocked algorithm, single core Pentium 660J processor. 



For matrices of order n between 1500 and 3500, the speedups are somewhat smaller, but 
have the same shape as in Figure [5J 

The second test was made on a single core of Intel Xeon E5420 processor (single core 
of test machine A, see later). We were quite surprised that the block-oriented algorithm is 
faster than the full block algorithm for all tested matrix sizes (see Figure [2]) . 
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Figure 3: Speedup of sequential blocked algorithms for matrix size n — 4000, compared to 
non-blocked algorithm, one core of Xeon processor. 

As neither of the sequential blocking strategies is clearly better than the other, we shall 
continue to use both of them in parallel algorithms, as well. 

4 Parallel two-level algorithms 

The idea of blocking can also be used to parallelize the Jacobi algorithm, but with an entirely 
different goal, to ensure data independency. This will be done simply by changing the block- 
pivot strategy. Except for the communication between processes, everything else is inherited 
from sequential block algorithms, so the computational procedure is essentially the same. 

The basic idea of parallelization is to partition G and J as in (|3.ip , and map this partition 
to processes. From now on, p will denote the number of available processes. Then, the 
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reasonable number of blocks in partition (|3.1I) is 2p. One can use 2p — 1 blocks, as shown in 
Figure 21 but in this case, one process has less work to do than the others. 
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Figure 4: Independent blocks for p = 3, for even (n = 6) and odd (n — 5) number of blocks. 
If n = 5, the workload of process 2 is approximately half of the workload of processes or 1. 



As we have already said, there are two standard parallel pivoting strategies that can be 
used as block-pivoting strategies in parallel Jacobi algorithms. 

1. The modulus pivot strategy is known to be convergent, but it uses one parallel step more 
per sweep than is minimally required. 

2. The round- robin strategy is appealing because it uses the minimal number of parallel 
steps per sweep. Unfortunately, nothing is known about its convergence properties. 

We decided to implement both strategies and compare them in practice. 

The ring or one-dimensional torus is a natural topology of the parallel processes in our 
algorithm. To minimize the communication in each parallel step, only one block-column of the 
current working matrix G is sent to one neighboring process, and the other block-column is 
received from the other neighboring process. This simple and uniform communication pattern 
is in fact a simultaneous one-step cyclic shift in the ring of processes. The network bandwidth 
is, therefore, fully utilized. The amount of transmitted data remains (almost) constant in the 
whole run, thus making the communication overhead predictable and scalable, if the network 
speed increases. 

Due to blocking, memory access in the communication routines is contiguous, and the 
relative impact of the network latency effectively diminishes as the problem size increases. 

Our first goal is to map the block-columns [Gj, Gj] and the corresponding diagonal blocks 
of J to individual processes in such a way to ensure data independency between processes 
and a natural communication pattern between them. 

The mapping to processes is inherited from the modulus strategy, with a slight shift with 
respect to the original modulus strategy presented in |21) . so that our modulus strategy 
begins on the main antidiagonal of the matrix. This is the initial layout for both parallel 
pivoting strategies. More precisely, the initial order of block-columns of G with respect to 
processes is (1, 2p), (2, 2p — 1), . . . , (p,p+l), i.e., block-columns (q + 1, 2p — q) reside in process 
q, for q = 0,...,p- 1. 

In the MPI terminology, q is usually called a rank of a process, and p is the total number 
of processes (denoted by nproc in all code segments). Each process, by knowing only its own 
rank and the number of blocks nbl := 2 ■ nproc, independently computes indices i_ blk, and 
j_ blk of its first pivot pair in a sweep. 



4.1 Modulus pivoting strategy 

Each block-sweep consists of a certain number of parallel steps according to the chosen block- 
pivot strategy. A single parallel step is a collection of all non-overlapping transformations 
executed in parallel (in all processes). Figure [S] shows the block layouts for the modulus pivot 
strategy in odd and even sweeps. 

In each sweep of the modulus strategy, the blocks denoted by □ are visited twice, and we 
have two possibilities: 
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Algorithm 4.1: Initialization of modulus and modified round-robin strategy 



Initialize (rank, nbl, ip, jp, i_blk, j_blk); 

Description: At the start of a sweep, the algorithm for process rank computes indices of the 
initial column blocks (i_blk,j_blk). The auxilliary pair (ip,jp) is used for 
determination of the pivot indices in the next steps. 

begin 

| ip = rank +1; i_ blk — ip; jp — nbl — rank; j_ blk = jp; 
end 



2 o 



□ 2 
1 o 



Odd sweep. 



o o 



o o □ ° 
■ o 1 □ o 

• o 2 



Even sweep. 



Figure 5: Modulus strategy for p — 3 processes and 2p — 6 blocks. A number denotes 
the index of a process where the block (two block-columns) resides, . denotes still unvisited 
blocks, o denotes already processed blocks, while □ denotes the blocks that will be visited 
twice. 



(a) transform the block twice, as in the so-called quasi-cyclic strategies, 

(b) just do nothing, and wait for the next pair of blocks. 

• o o □ o o 
■ o o □ o 

• O O □ 

• o 



Figure 6: For n = 3 and p — 6, the blocks visited twice are denoted by □. 

In our implementation, these blocks are transformed twice. Note that doubly transformed 
blocks are always at the same place, occupying the n-th superdiagonal. Figure H2 shows the 
blocks that are transformed twice in the first sweep. 

Subroutine MMStep in Algorithm 14.21 calculates the next pivot indices for process rank 
according to the modified modulus strategy. By using the sweep-number parity, nsweep, it 
determines the next process-sender rcv_ rnk, which sends its block-column rcv_ blk to rank, 
and the next process-receiver snd_ rnk, which receives block-column snd_ blk from rank. 

4.2 Round-robin pivoting strategy 

The standard round-robin tournament strategy has to be modified to ensure that each process 
sends and receives only one block-column in each parallel step. The strategy is best described 
by grouping the indices of the block-columns in two horizontal groups, where the indices above 
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Algorithm 4.2: A step of a modified modulus strategy 



MMStep (rank, nbl, nproc, nsweep, ip, jp, i_blk, j_blk, snd_rnk, snd_blk, rcv_rnk, rcv_blk); 
Description: If a step is not the first one in a sweep, process with rank rank determines the 
indices of the next pivot block (i_ blk,j_ blk). 

begin 

if (ip + jp) > nbl then 

snd_ blk = i_ blk; ip = ip + 1 ; 
if ip = jp then 

| ip = ip — nbl/2; jp = ip; 
end if 

%_ blk = ip; rcv_ blk = i_ blk, 
else 

| snd_blk = j_blk, jp = jp+l; j_blk = jp; rcv_blk — j_blk, 
end if 

if (nsweep mod 2) > then 

| snd_ rnk — (nproc + rank — 1) mod nproc; rcv_ rnk = (nproc + rank + 1) mod nproc; 
else 

| snd_ rnk — (nproc + rank + 1) mod nproc; rcv_ rnk = (nproc + rank — 1) mod nproc; 
end if 
end 



each other belong to the columns in the same process. 

In the standard strategy, the indices are rotated counter-clockwise by one to obtain the 
next layout. In the modified strategy, the indices are rotated in the clockwise direction by 
[(2p — 1)/2J. This process is illustrated in Figure [7] 

123 451 234 512 345 

O^O^O O^O^O O^O^O O^O^O O^O^O 

\ i \ f \ I \ * \ ♦ 

o o— o o o— o o o— o o o— o o o— o 

654 632 615 643 621 



Figure 7: Modified round-robin strategy for p = 3. 

It is worth to mention that our ordering is "preferable" to Brent and Luk's ordering [S] 
since it is not working on elements reduced in the previous step (see [141 page 455] for details), 
at least for small number of processes p. Unfortunately, the proof whether or not is modified 
round-robin strategy convergent seems to be uneasy. 

The block layouts in odd and even sweeps for the modified round-robin strategy are shown 
in Figure [51 

Subroutine RLStep in Algorithm 14.31 calculates the next indices of blocks (i_blk,j_blk) 
which will reside in process rank, according to the modified round-robin strategy. All used 
variables have the same meaning as in subroutine MMStep. The only difference is variable 
swflag. The value swflag — 1 means that the blocks in a process rank are placed in reverted 
positions, i.e., i_ blk > j_ blk, and should be renumerated such that i_ blk < j_ blk. 

It should be said that, despite its optimality with respect to the number of parallel steps, 
in most of our tests, the round-robin turned out to be slower than the modulus strategy. 

4.3 Two-level algorithms 

In a given parallel step, each process has is own local pair of block-columns := [Gi,G.j\, 
where q is the index of the process, that generates a "local" block-pivot matrix. To simplify 
the notation, this block-pivot matrix will be denoted by A^ q \ and appropriate J by J^. So, 
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Figure 8: Modified round-robin strategy for p = 3 processes and 2p = 6 blocks. A number 
denotes the index of a process where the block (two block- columns) resides, . denotes still 
unvisited blocks, and o denotes already processed blocks. 



Algorithm 4.3: A step of the modified round-robin strategy 

RLStep (rank, nbl, nproc, nsweep, ip, jp, i_blk, j_blk, snd_rnk, snd_blk, rcv^rnk, rcv_blk, 
swflag); 

Description: If the step is not the first one in the sweep, process with rank rank determines 
the indices of the next pivot block (i_blk,j_blk). 

begin 

swflag — 0; 

if (ip + jp) > nbl then 
if jp < nbl then 

snd_ blk = i_ blk; ip — ip + 1 ; 
if ip < jp then 

| i_ blk = ip; rcv_ blk — i_ blk; 
else 

ip = jp; i_ blk — j_ blk, swflag = 1; jp = nbl; j_ blk — jp; 
rcv_ blk — j_ blk; 
end if 
else 

if ip > (nbl/2) then 

| snd_ blk = i_ blk, ip — ip — nbl/2 + 1; i_blk= ip; rcv_ blk — i_ blk; 
else 

| snd_ blk = j_ blk; jp — ip + 1 ; j_ blk = jp; rcv_ blk = j_ blk; 
end if 
end if 
else 

| snd_blk = j_blk; jp = jp+l; ]_blk = jp; rcv_blk — j_blk; 
end if 

if (nsweep mod 2) > then 

| snd_ rnk — (nproc + rank — I) mod nproc; rcv_ rnk = (nproc + rank + I) mod nproc; 
else 

| snd_ rnk — (nproc + rank + I) mod nproc; rcv_ rnk = (nproc + rank — I) mod nproc; 
end if 
end 



as in (|3.6p . we have 
AM = 



4- 

A 



j j 



[ G M]* G M = [G l ,G J ]*[G l ,G ] 



G*G i 



G*G 3 
G j G i 



(4.1) 



In two-level algorithms, the first step is different than the others. 
1. The matrix pair (A^,M), for each process q is diagonalized by the J^-unitary trans- 
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formations, but the special formulae (|3.9[) - (I3.11I) cannot be used for the Cholesky fac- 
torization, since the diagonal blocks of are not yet diagonal. 
2. In the block-oriented case the full sweep of J^-unitary transformations is applied to the 
off-diagonal elements of A™ . 
The other steps in a sweep consist of the following computational tasks in each process q: 

1. local matrix pair (A^ q \ J' 9 ') is 

(a) diagonalized in the full block case, but note that now A^ has diagonal blocks already 
diagonalized, and (I3.9l) - (|3.1ip can be used, 

(b) in the block-oriented case, only elements of block A i ■ are annihilated. 

In both cases the local transformation matrix is accumulated throughout this pro- 
cess, and 

2. a local pair of block-columns is updated by W^ q \ i.e., [G^]' := [G[, G' } ] = G^W^. 
This is followed by a communication step to interchange block-columns between processes. 

By using previous two algorithms, and the communication, we can easily write both 
two-level algorithms (see Algorithm 15. 2p . 

5 Parallel three-level algorithms 

If we compare the optimal block sizes (which depend on the order n of a matrix) for the 
sequential full block and block-oriented algorithms in Figures [5] and [31 we see that the optimal 
sizes are somewhere between 8 and 48 for the full block algorithm, and between 24 and 128 
for the block-oriented algorithm. 

This suggests that, as soon as the block size in each process becomes sufficiently large, 
it is advisable to use a sequential blocked algorithm for local diagonalization, instead of 
its non-blocked counterpart. This motivates us to construct the "doubly" blocked or three- 
level versions of parallel J- Jacobi algorithms and compare their efficiency to the non-blocked 
(two-level) parallel algorithms. 

5.1 Three-level full block algorithm 

The idea behind the three-level full block algorithm is to use the sequential blocked algorithm 
with optimal sequential block size inside each process. Note that inside each process we have 
two block-columns, each with n rows and approximately columns, with rib *C n. 

The first step of the algorithm is to "shorten" these two "tall" local columns [Gi,Gj]. 
These columns are multiplied to obtain the local block-pivot matrix A^ as in (|4.1[l . and the 
Cholesky factorization of A^ gives a significantly smaller factor of A^ — the order of 
is approximately 2nb- The sequential blocked one-sided Jacobi algorithm is then used 
to compute the HSVD of i?M with respect to jM . 

In the final step, the accumulated transformation matrix that diagonalizes the local 
pair {A^ q \ ) is applied to update the block-columns [Gi ,Gj}. Note that if the second level 
(in cache, or inner) method is chosen to be the full block method, it is not necessary to take 
care whether the inner block partition respects the boundary between two "outer" blocks 
[Gi, Gj], or not. 

5.2 Three-level block-oriented algorithm 

The basic idea is almost the same as in the three-level full block algorithm, but we have to 
take care that the inner block partition respects the boundary between the two "outer" local 
blocks [Gi,Gj]. 

Here, the distinction has to be made between two different types of sub-blocks in the local 
block- pivot matrix A^ from (|4.ip . 
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1. The diagonal sub-blocks An pose no problem, as they are diagonalized only once in each 
block-sweep, and the corresponding block-column Gi can be locally partitioned in the 
usual manner. 

2. To deal with, generally, rectangular off-diagonal block Ay = G*Gj, each of the local 
blocks Gi and Gj have to be partitioned individually, as in (|3.ip . This correctly defines 
an "inner" partition of the block i?y in the Cholesky factorization of A\ q \ and this block 
is used by the sequential block-oriented one-sided J-Jacobi algorithm. This partitioning 
procedure is given in more detail in Algorithm 15. 11 



Algorithm 5.1: Off-diagonal step of parallel three-level block-oriented algorithm 
Off_Diagonal(G, J, D, W , m, n r , n s ) Description: Single annihilation pass in two-level 

blocked manner of the off-diagonal block of A rs — G*G S of A = G*G, where 

G = [G r G a ] by the block-oriented Jacobi algorithm. 
Assumption: Block Gi has m columns. 

begin 

partition G r into almost equally-sized column blocks Gi, i = 1, . . . , b r ; 
partition G„ into almost equally-sized column blocks Gi, i = b r + 1, . . . , b; 
b = b r + b s ; n = n r + n s ; W = I n \ 
corresponding partition of J is J = diag(Jn, . . . , Jbt); 
for j = b r + 1 to b do // single pass through A rs 
for i = 1 to b r do 

W P = I ni+nj ; A P = [Gi Gj]*[Gi Gj]; J P = diag(J«, Jjj); 
Cholesky factorization of Ap = R* R; 
Jacobi_Cycle(_R, Jp, Dp, Wp, m, Hi, rij , false); 
[Gi G 3 ] = [Gi G 3 ] ■ W P ; [Wi Wj] = [W t W 3 ] ■ W P 
end for 
end for 
end 



5.3 The core of the parallel algorithms 

The main similarities and differences of the two-level and three-level parallel algorithms are 
summarized at the top level by the following description. 

The factor G is distributed evenly to p processes as 2p block-columns of n rows, and 
approximately ^ columns each, where the number of columns varies by at most one between 
all block-columns. A pair (q + 1 , 2p — q) of block-columns is assigned to a process q, for 
< q < p. Processes are organized into a one-dimensional torus, where each process has 
exactly two neighbors to communicate with. In our implementation, each process is bound 
to a single processor core. 

As a first step in each cycle, the full block (F) strategy diagonalizes diagonal blocks in 
parallel to obtain diagonal blocks in diagonal form (|3.7p for the rest of the sweep. Note that 
this could be done only in the first sweep, but we prefer occasional re-diagonalization for the 
sake of the accuracy of the special Cholesky factorization (|3.9|) — (|3.11j) . In the block- oriented 
(B) algorithms, in the first step of each sweep the off-diagonal elements of the whole A^ 
are annihilated, while in the rest of the sweep, only elements of the off-diagonal block Aij in 
(|4.ip are annihilated. 

The algorithm works in sweeps, until convergence is detected. Convergence test is twofold: 
either no rotations in the previous sweep were applied, or rotation angles were all below the 
predefined quadratic convergence threshold. 

Each sweep makes 2p (modulus) or 2p— 1 (round- robin) steps. In each step p independent 
pairs of block-columns [Gi,Gj] are transformed in parallel. The choice of pairs is directed 
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by the parallel pivot strategy. Each pair of block-columns is acted upon at least once in a 
sweep. 

The anatomy of a step for the process q follows: 

1. A "small" matrix A^' — [Gi,Gj]*[Gi,Gj] is formed from the two block-columns, Gi and 
Gj, local to process q. 

2. The Cholesky factorization is performed to obtain = [R^]*R^ q \ with square. 
In the full block strategy, is computed by using (|3.9D - (|3.11() . 

3. The columns of R$ are transformed by the hyperbolic one-sided Jacobi algorithm: 

(a) In the full block strategy, the hyperbolic singular- value decomposition of i?^ is com- 
puted, such that R^W^ = U [q] T, [q] . i.e. columns of are orthogonalized. 

(b) In the block- oriented (B) strategy, the off-diagonal elements of A^ are annihilated 
exactly once by using congruences. Instead of applying transformations to both sides 
of A^\ i.e., [A^Y = [W [q] ]* A^W [q] , they are applied to the right-hand side of Rh\ 
i.e., = R^W^. 

All the transformations applied are accumulated in the matrix in both cases. If 
R^ is large enough, the orthogonalization (annihilation) is performed by the blocked, 
otherwise by the non-blocked one-sided hyperbolic Jacobi algorithm. This is the only 
difference between three-level (3B, 3F) and two-level (2B, 2F) algorithms. 

4. The accumulated transformations are applied to the right-hand side of [Gi, Gj], i.e., 

5. According to the parallel pivot strategy, one block-column is chosen to be sent to the one 
of the neighboring processes, and is simultaneously replaced by another block-column 
received from the other neighbor. 

This description, written in pseudo-code as Algorithm 15.21 shows how simple the effi- 
cient parallel Jacobi algorithm can be built from the essential routines: the non-blocked and 
blocked sequential algorithms together with simple communication routines. 

6 Numerical testing 

Our parallel Jacobi algorithms are implemented in Fortran, using Intel MKL for sequential 
BLAS and LAPACK routines, HDF5 libraries for storing data, and synchronous point-to- 
point and collective communication routines of MPI-1 from Open MPI distribution. There 
are no inherent constraints in the code on hardware platform, cluster configuration or matrix 
dimensions, save physical and method-induced ones. Test processes are 64-bit and single- 
threaded. Each process is bound to its own processor core and communicates with others 
through MPI stack only. 

6.1 The test machines 

Test machine A is a five-blade cluster. Each blade has two Intel Xeon E5420 processors, 16 
GB DDR2 RAM, and a gigabit Ethernet network card connected to a switch. The processors 
run at 2.5 GHz, with 1333 MHz FSB, 2x6 MB level-2 cache memory, and SSE 4.1 instruction 
set available. Each processor has four cores with independent 32 + 32 kB level-1 caches. The 
cores of a processor share level-2 cache in pairs. Cache memory at both levels runs at the 
core speed. 

For the initial testing two other parallel machines, B and G, were used. Machine B is 
a single SMP server with NUMA architecture, Sun Fire X4600 M2, with 8 dual-core AMD 
Opteron 8220 CPUs (core speed 2.8 GHz, 1 MB level-2 cache per core), and 48 GB DDR2/667 
ECC RAM. Machine G is a cluster of 16 blades, each blade consisting of two dual-core Intel 
Xeon 5150 CPUs (core speed 2.66 GHz, 32 + 32 kB level-1 cache per core, and 4 MB level-2 
cache shared between the cores), 8 GB DDR2 RAM, and a gigabit Ethernet network card 
connected to a switch. 
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Algorithm 5.2: Parallel block Jacobi algorithms 



Parallel_Jacobi {G, J, m, n); 

Description: Diagonalization of a Hermitian matrix H, given in factored form H — GJG* , 

G € C mxn via the one-sided hyperbolic SVD of G. 
Assumption: G is divided into nbl block-columns, and J into nbl diagonal blocks. In each 

process the first block-column is denoted by index i_ blk, and the second by 

j_ blk. 

begin 

Initialize (rank, nbl, ip, jp, i_blk, j_blk); 

receive/read appropriate G [q] = [G 1 _ b ik G 3 _ b ik], J [q] = dia.g( Ji_ b ik,i_bik, Jj_bik,j_bik); 
repeat 

first_ step = true; n M = n,_ b lk + n 3 _ blk ; 

//compute square factor of matrix A^; 

A li] = [gM]*[G m ]; 

Cholesky factorization of A lq] = [R [ql ]* R [q] ; 

// call apropriate sequential algorithm for each block; 
switch algorithm of 
case 2F: 

j Jacobi_Diagonalization(i? M , J lq] , D lq] , W lq] , n lq] , n lq] , maxit); 
endcase 
case 2B: 

if first_step then 

Jacobi_Diagonalization(i? M , J lq] , D lq] , W lq] , n lq] , n lq] , 1); 
first_ step — false; 
else 

i Jacobi_Cycle (RM, JW, DM, VK m , n W, nj_ ttt , false); 

end if 
endcase 
case 3F: 

| Full_Block(i? M , J M , D M , W [q \ n lq] , n lq] , max it) 
endcase 
case SB: 

if first_step then 

| Block_Oriented(i? M , J M , D M , W [q] , n lq] , n lq] , 1); first_ step = false; 
else 

| Off .Diagonal (flW, J lq] , D M , If 1 ' 1 , n 1 ' 1 , 7i;_ 6ifc , n 3 _ wjt ); 
end if 
endcase 
endswitch 

Q[l] = Qll] . W [q\. 

//calculate send/receive indices, for example by MMStep ; 
MMStep (rank, nbl, nproc, nsweep, ip, jp, i_blk, j_blk, snd_rnk, snd_blk, rcv_rnk, 
rcv_ blk); 

send G S nd_ b ik, D sn d_ b i k , and J sn d_blk,snd_blk to snd_rnk; 
receive G rcv _uk, D rcv _ bUc , and J r cv_bik,rcv_bik from rcv_rnk; 
exchange convergence informations; 
until convergence; 

eigenvectors: normalized columns of G' g '; 

eigenvalues: squared norms of columns of G^ multiplied by signs from J [q] ; 



end 
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All machines are running 64-bit GNU/Linux (Red Hat on A, Debian on B and C). 

Due to operating system's process scheduler tendency to occasionally reschedule a pro- 
cess to a different core or even processor socket from one it was running on, what induces 
unwanted cache spills and possibly issues of non-local memory accesses NUMA architectures 
are suffering from, a simple scheduling protocol is devised to bind processes permanently 
during a test run to processor cores, in a one-to-one manner. Very similar features have 
recently been implemented in Open MPI version 1.3, and since then have been used in our 
testing instead. 

Initial tests on machine B have exhibited severe timing inconsistencies between test runs 
of the same algorithm over identical data sets, differing by a factor of up to 2.5. Repeated 
identical runs of processes pinned to cores our protocol (or, lately, Open MPI) dedicated for 
them, however, differ by no more than 1-2% on machine B, and by just a few seconds on 
machines A and C. 

6.2 The test results 

A production setup of the three-level J- Jacobi algorithms should begin with the careful profil- 
ing of the corresponding (full block or block-oriented) sequential blocked J-Jacobi algorithm, 
to reveal the optimal inner block size for the particular hardware. Our testing has shown 
that there exists a global optimum over all inner block sizes for the given machine and the 
problem size. The optimal block size is expected to be a slowly growing function of the 
problem size on any modern processor architecture. 

Similarly, the decision between the full block and block-oriented three level algorithms is 
platform and problem type dependant. The testing suggests that for smaller problems and in 
an environment where interprocess communication is relatively slow (e.g., Ethernet) the full 
block algorithm is a method of choice. For larger problems and with a fast communication 
available (e.g., local shared memory) the block-oriented algorithm wins. It is up to the 
implementor, however, to correctly identify the adequate algorithm for the targeted machines 
and spectra of problem sizes. 

Our test results show that the modified round-robin strategy is always slower than the 
modulus strategy. On machine A, it is about 3% slower on average, while on machine B, 
for very large matrices, it can be even two times slower. Therefore, we shall present more 
detailed test results only for the modulus strategy. 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
dimension /1000 



Figure 9: Modulus strategy: measured time for two-level full block method (2F), for p = 
16, . . . , 40 cores. 
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Figure 10: Modulus strategy: measured time for two-level block-oriented method (2B), for 
p = 16, ... ,40 cores. 

As we expected, a significant speedup is obtained if we compare three-level versus two- 
level algorithms. In the case of modulus pivot strategy, the ratio between two- and three- level 
algorithm is approximately 30% for sufficiently large matrices. From Figure [Ol it is clear that 
the usage of three-level method is advisable (on machine ^4) if the order of matrix is 9000 or 
more. 

In the case of block-oriented strategy (see Figure IT4"f , the ratio between two- and three- 
level methods (on machine A) is approximately 55% for large matrices. 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
dimension/1000 



Figure 11: Modulus strategy: measured time for three- level full block method (3F), for 
p = 16, . . . , 40 cores. 

The previous two graphs show us only that by usage of blocking we obtained the significant 
speedup, but they do not show the true relation between speeds of these two methods. 
Finally, if we compare times for various algorithms (see Figures I MTS"]) , we can conclude that 
the modulus version of the three-level block-oriented algorithm is the fastest one. 
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Figure 12: Modulus strategy: measured time for three-level block-oriented method (3B), for 
p = 16, ... ,40 cores. 
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Figure 13: The ratio of times for three- level full block and two-level full block method 
(modulus pivot strategy), p = 16 cores. 
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Figure 14: The ratio of times for three-level block-oriented and two-level block-oriented 
method (modulus pivot strategy), p — 16 cores. 



6.3 A simple time model 

From the theoretical viewpoint, complexity of the sequential Jacobi algorithm is proportional 
to c„ • n 3 . The algorithm is almost ideally paralellizable, and it can be divided into p almost 
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equally sized processes. Therefore the expected execution time should be 

n 3 

Time(n,p) = c — , 
P 

where c is a parameter that depends on the chosen algorithm, n, p. and the speed of the 
interprocess communication. 
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Figure 15: Modulus strategy: parameter c for 2F, Time(n,p) = c— . 
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Figure 16: Modulus strategy: parameter c for 2B, Time(n,p) = c— . 



We really cannot expect that c is a constant. For example, the number of sweeps is a 
slowly growing function of n. Nevertheless, the experiments show that c is almost constant 
in the middle range, when neither communication, nor the block size inside each core are the 
dominant factors. Figures IT5H18I show that phenomenon. 

Finally, for a more accurate time prediction model we need significantly larger test space 
and a precise model that captures subtle properties of the processor, cache, memory and 
network behavior. 
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Figure 17: Modulus strategy: parameter c for 3F, Time(n,p) = c- 
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Figure 18: Modulus strategy: parameter c for 3B, Time(n,p) = c 2 ^-. 



Conclusion 

In this paper we have introduced the two- and three-level parallel hyperbolic Jacobi algo- 
rithms. These types of organizations of parallel sweeps can make diagonalization process 
significantly faster for the large enough matrices. The convergence of the algorithms (with 
modulus block strategy and cyclic by rows/columns inside blocks) is a direct consequence of 
the convergence of the row/column cyclic block algorithms |17l 118) . 

Our bunch of algorithms permits fine tuning for intricacies of the targeted computer archi- 
tecture. The choice of block-oriented or full block algorithm depends on the communication 
speed between processes — the full block algorithms are suitable for the slower speed, and 
block-oriented for the faster. Once the crossings between two- and three-level algorithms 
are found on the particular architecture, the three-level algorithms can be additionaly finely 
tuned — the size of inner blocks should be determined for optimal performance. 

The testing results show almost perfect scalability (see Figures [T5HT8")) . in a sense that c 
is really a constant for the middle ranges of n and p. 
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